Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stdlib] Add more utf-8 validation unit tests #3405

Conversation

gabrieldemarmiesse
Copy link
Contributor

Part of my work on utf-8 validation. The new algorithm is more complexe, thus to improve our trust in it, I added new unit tests.

@gabrieldemarmiesse gabrieldemarmiesse requested a review from a team as a code owner August 22, 2024 13:17
Signed-off-by: gabrieldemarmiesse <[email protected]>
Copy link
Collaborator

@JoeLoser JoeLoser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for improving the tests! Do you or other people in the community you've been working with have a more sketched out broader/longer-term design for incorporating UTF-8 strings in the stdlib more generally speaking (e.g. should it be String type, a distinct type, how do we think about different ways of indexing (by byte, code point, etc.)?

assert_false(validate_utf8(sequence[]))


def test_combinaison_good_utf8_sequences():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_combinaison_good_utf8_sequences():
def test_combination_good_utf8_sequences():

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix this and the other similar typo internally when I import it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oupsy, my french is leaking

@JoeLoser
Copy link
Collaborator

!sync

@modularbot modularbot added the imported-internally Signals that a given pull request has been imported internally. label Aug 22, 2024
@gabrieldemarmiesse
Copy link
Contributor Author

About utf-8 and Strings longer term, there was quite a bit of discussion on the discord about this. (internal encoding + null terminator) and while there wasn't a consensus, there is a subset of ideas that everyone agreed on. I propose we do the non-controversial work first and then we'll be able to discuss the more controversial changes.

Non-controversial changes

Everyone agreed about the necessity of having utf-8 strings in one shape or another with a null terminator for compatibility with the OS. I propose we go in this direction first and make our String type an utf-8 string with a null terminator always present.

Controversial changes

Now for the rest of the ideas that we can discuss later, there was:

  • Adding a struct parameter to String Saying if the null terminator was present or not, and not present by default.
  • Adding a struct parameter to String indicating the internal encoding, notably to have an internal encoding compatible with python, which does not have utf-8 string. This parameter can also indicate that a string is ASCII only, type promotion can be applied when mixing strings with different internal representations.
  • Add a table for faster indexing in utf-8 strings.

Those three changes were controversial, and complexe. Especially with Python devs talking about changing the internal representation of str to use UTF-8, see faster-cpython/ideas#684 for their faster-cpython initiative.

So yeah the immediate goal would be to integrate utf-8 to our String struct. We'll see later for the rest.

@modularbot
Copy link
Collaborator

✅🟣 This contribution has been merged 🟣✅

Your pull request has been merged to the internal upstream Mojo sources. It will be reflected here in the Mojo repository on the nightly branch during the next Mojo nightly release, typically within the next 24-48 hours.

We use Copybara to merge external contributions, click here to learn more.

@modularbot modularbot added the merged-internally Indicates that this pull request has been merged internally label Aug 23, 2024
modularbot pushed a commit that referenced this pull request Aug 24, 2024
[External] [stdlib] Add more utf-8 validation unit tests

Part of my work on utf-8 validation. The new algorithm is more complex.
So, to ensure everything is working as expected, add some additional
unit tests.

Co-authored-by: Gabriel de Marmiesse <[email protected]>
Closes #3405
MODULAR_ORIG_COMMIT_REV_ID: 0dbb5c80b8326d6063f64f5bd998e509d72ecf67
@modularbot modularbot added the merged-externally Merged externally in public mojo repo label Aug 24, 2024
@modularbot
Copy link
Collaborator

Landed in b40ec35! Thank you for your contribution 🎉

@modularbot modularbot closed this Aug 24, 2024
msaelices pushed a commit to msaelices/mojo that referenced this pull request Sep 10, 2024
[External] [stdlib] Add more utf-8 validation unit tests

Part of my work on utf-8 validation. The new algorithm is more complex.
So, to ensure everything is working as expected, add some additional
unit tests.

Co-authored-by: Gabriel de Marmiesse <[email protected]>
Closes modularml#3405
MODULAR_ORIG_COMMIT_REV_ID: 0dbb5c80b8326d6063f64f5bd998e509d72ecf67

Signed-off-by: Manuel Saelices <[email protected]>
modularbot pushed a commit that referenced this pull request Sep 13, 2024
[External] [stdlib] Add more utf-8 validation unit tests

Part of my work on utf-8 validation. The new algorithm is more complex.
So, to ensure everything is working as expected, add some additional
unit tests.

Co-authored-by: Gabriel de Marmiesse <[email protected]>
Closes #3405
MODULAR_ORIG_COMMIT_REV_ID: 0dbb5c80b8326d6063f64f5bd998e509d72ecf67
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
imported-internally Signals that a given pull request has been imported internally. merged-externally Merged externally in public mojo repo merged-internally Indicates that this pull request has been merged internally
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants